 |
 |
XML for the absolute beginner
A guided tour from HTML to processing XML with Java

Printer-friendly
version | Mail this to a friend
Page 4 of 10
Make up a markup While a
well-formed document is well-formed because it follows rules defined by
the XML spec, a valid document is valid because it matches its document
type definition (DTD). The DTD is the grammar for a markup language,
defined by the designer of the markup language. For my little XML recipe
in Listing 3, for example, that designer would be me. The DTD specifies
what elements may exist, what attributes the elements may have, what
elements may or must be found inside other elements, and in what order.
Nonvalidating parsers read the XML and, if it's well-formed,
give you back the document structure as a tree of objects. We'll discuss
the document structure you get from a parser in the section below entitled
"The Document Object Model." If the document is well-formed but the
elements are nonsensical (as was the case with the two
<Qty> elements in the <Ingredient>
above), that's your problem.
This is, in fact, how HTML browsers work. Generally, HTML parsers are
nonvalidating. The various "HTML checking" parsers, which report sytax
errors in HTML, are essentially validating HTML parsers (with additional
functionality, like link checking).
Validating parsers read XML, verify that it's well-formed
(just as nonvalidating parsers do), and then go on to determine
whether the document's element tags are legal, whether the attribute names
make sense, whether every element nested inside another element belongs
there, and so on.
The DTD defines the document type. It accounts for the
Extensible in XML. The DTD is how you actually define a new
markup language -- what I often call a dialect of XML. DTDs currently are
being written for an enormous number of different problem domains, and
each DTD defines a new markup language. New markup languages now exist, or
are being designed, to mark up the plays of Shakespeare; to define general
data resources (RDF); to model information in the health care industry
(HL7 SGML/XML); to typeset, display, and actively use mathematical
equations (MathML); and to perform electronic data interchange (XML/EDI).
There's even a proposal for a markup language for business data in the
footwear industry (FDX). (No, I'm not joking.)
Central to each of these new languages is a DTD that describes what
tags the markup language has, what those tags' attributes may be, and how
they may be combined. A DTD specifies very clearly what information may or
may not be included in a markup language. For instance, the DTD for HTML
does not allow for markup tags to select paper size for printing.
Let's take a look at a DTD for the recipe XML in Listing 3. I'm going
to call it JWSRML (JavaWorld Scary Recipe Markup Language).
Apologies to anyone already using that acronym.
<!-- This is the example DTD for the example XML
--> <!ELEMENT Recipe (Name, Description?, Ingredients?,
Instructions?)> <!ELEMENT Name (#PCDATA)> <!ELEMENT
Description (#PCDATA)> <!ELEMENT Ingredients
(Ingredient)*> <!ELEMENT Ingredient (Qty,
Item)> <!ELEMENT Qty (#PCDATA)> <!ATTLIST Qty unit CDATA
#REQUIRED> <!ELEMENT Item (#PCDATA)> <!ATTLIST Item
optional CDATA
"0" isVegetarian
CDATA "true"> <!ELEMENT Instructions (Step)+>
Listing 4. The DTD for JWSRML
The document type definition in Listing 4 defines a language for a
validating parser to accept -- meaning, the parser will produce errors if
the rules listed in the DTD aren't followed. To get a general idea of how
a DTD works, let's look at what a few of the lines in this file mean.
<!ELEMENT Recipe (Name, Description?, Ingredients?,
Instructions?)> The
<!ELEMENT...> statement defines a tag in the
document. This tag defines a <Recipe> tag, stating
that it can contain a <Name>, an optional
<Description> (the question mark [?] denotes
optionality), an optional<Ingredients> tag, and an
optional<Instructions> tag.
<!ELEMENT Name
(#PCDATA)> This simply states that a
<Name> tag can contain character data and nothing
else.
<!ATTLIST Item optional CDATA "0" isVegetarian
CDATA "true"> This section states that the
<Item> tag has two possible attributes:
optional, whose default value is 0; and
isVegetarian, whose default value is true.
Notice that attribute values aren't limited to numbers; they can be any
text.
A DTD is associated with an XML document by way of a document type
declaration, which appears at the top the XML file (after the
<?xml...?> line). The document type declaration may
contain either an inline copy of the document type definition or contain a
reference to that document as a system filename or URI (universal resource
ID). For example,
<!DOCTYPE Recipe SYSTEM "example.dtd">
tells the parser to start looking for a <Recipe> tag
as the top-level tag of the document. It also declares that the DTD is in
the system file example.dtd.
There are other characters and notations in the DTD, but writing DTDs
is a topic unto itself. If you're interested in learning more, check out
the DTD-related links in Resources.
You now know a lot about how XML is structured and controlled, but you
haven't heard what it's good for. Why are people so excited about this
technology?
Next
page > Page 1 XML
for the absolute beginner Page 2 HTML:
All form and no substance Page 3 An
XML conceptual example Page 4 Make up a markup Page 5 So,
what good is made-up markup? Page 6 Cascading
Style Sheets: not just for HTML anymore Page 7 XSL:
I like your style Page 8 Modeling
information structure in XML Page 9 XML
and Java Page 10 Become
a tree surgeon!
Printer-friendly
version | Mail this to a friend
Resources There are so
many XML resources on the Web, I've had to categorize. The first section
here is the most useful, since the documents are either high-level
summaries or excellent link sites. Apologies to anyone who was omitted.
XML and Java: General XML resources
- "XML, Java and the Future of the Web," Jon Bosak. The paper that
started it all, at least from a Java programmer's point of view.
Definitely worth a read, even if it's a bit dated. Jon is commonly
considered to be the father of XML. Funny how all of these technologies
seem to have paternity:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.html
- "Media-Independent Publishing: Four Myths about XML" Jon Bosak:
http://metalab.unc.edu/pub/sun-info/standards/xml/why/4myths.htm
- Robin Cover's XML-SGML site is, according to my SGML buddies, the
bible of XML resources:
http://www.oasis-open.org/cover/
- The W3C's XML resource page lets you cheer from the sidelines as XML
technology proposals develop into recommendations, or join in the fray
on their active mailing lists:
http://www.w3.org/XML/
- OASIS, the Web site of the Organization for the Advancement of
Structured Information Standards, offers general news and information
about XML:
http://www.oasis-open.org/
- The Graphics Communications Association, host of the XTech '99
conference (March 11 to 13, 1999, San Jose, CA) and the upcoming XML
Europe '99 conference in Granada, Spain, (April 26 to 30, 1999) has a
Web site packed with XML information:
http://www.gca.org/
- XML.com is great for watching trends and digging up XML news:
http://www.xml.com/
- Textuality hosts Tim Bray's site. Check it out for a look at the
"big picture" of how XML fits into the structured document universe --
and for a look at Lark, Tim's nonvalidating XML processor:
http://www.textuality.com/
- The XML FAQ:
http://www.ucc.ie/xml/
- IBM's XML Website is an outstanding supplement to alphaWorks:
http://www.software.ibm.com/xml/index.html
XML and Java
- "XML and Java: The Perfect Pair" by Ken Sall (Internet.com, November
1998) provides information about XML, Java, and why these two are a
match made in heaven:
http://wdvl.com/Authoring/Languages/XML/Java/index.html
Tutorials and training
- Generally Markup, Richard Lander's Web site may be of interest to
you if you haven't yet read enough about markup languages:
http://pdbeam.uwaterloo.ca/~rlander/
- The Mulberry Technologies Web site is a good resource for commercial
training in XML, as well as general XML and SGML consulting by seasoned
SGML experts:
http://www.mulberrytech.com/
- The Web Developer's Virtual Library Series on XML offers good
summaries of various XML technologies, as well as annotated indices of
XML software:
http://wdvl.com/Software/XML
- Microsoft's Site Builder Network provides a series of articles
called "Extreme XML," one of which appears in the following link. While
some of it focuses on Microsoft-only, Windows-only technology, there's
still some great stuff here:
http://www.microsoft.com/sitebuilder/magazine/xml.asp
- Webmonkey has a good series of articles introducing readers to XML.
The index is at:
http://www.hotwired.com/webmonkey/xml/?tw=xml
- "What the ?xml!" by L.C. Rees offers an interesting take on XML and
why it's necessary -- nicely written and entertaining to boot:
http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html
- "The XML Revolution" by Dan Connolly is a quick backgrounder on XML
(Nature):
http://helix.nature.com/webmatters/xml.html
Cascading Style Sheets
- W3C's CSS page will get your started learning about CSS:
http://www.w3.org/Style/CSS/
- "Cascading Style Sheets Designing for the Web" by Hakom Wium Lie and
Bert Bos (Addison-Wesley, 1997) Sample chapters from the book appear at:
http://www.awl.com/cseng/titles/0-201-41998-X/liebos/
Extensible Style Language (XSL)
- The W3C's XSL page:
http://www.w3.org/Style/XSL/
- Read (and comment on) the W3C's XSL Working Draft (currently dated
December 16, 1998):
http://www.w3.org/TR/WD-xsl
- "The Extensible Style Language: Styling XML Documents"
(WebTechniques Magazine) XSL tutorial information and examples:
http://www.webtechniques.com/features/1999/01/walsh/walsh.shtml
- Microsoft's XML and XSL tutorial site is especially interesting
because of the recent release of client-side XSL in Internet Explorer
5.0. Extensive and excellent:
http://www.microsoft.com/xml
- If you're still using IE 4.0, you can still experiment with XML,
using Microsoft's internal DOM:
http://www.microsoft.com/xml/articles/xmlmodel.asp
- If you want to experiment with XSL, try downloading IBM's LotusXSL.
It's all Java, and for the time being, it's free:
http://www.alphaworks.ibm.com/tech/LotusXSL
- Or, you can try James Clark's XT XSL engine, downloadable from:
http://www.jclark.com/xml/xt.html
Upcoming XSL contest
Though the details aren't yet worked out, Sun Microsystems will soon
announce a call for proposals for a $30,000 grant to develop a
client-side processor for full XSL implementation in Mozilla.
It will also announce, in conjunction with Adobe, a contest (first prize
$40,000, second prize $20,000) to develop a pure-Java, server-side
processor of the entire XSL language, to format XML to PDF (Adobe's
document format). Keep watching the Java Developer Connection (requires
free registration), and Mozilla sites for the eventual announcements.
- "XTech '99: Java and the XML wave" by Mark Johnson
(JavaWorld, April 1999) offers the most current information on
the contest:
http://www.javaworld.com/javaworld/jw-04-1999/jw-04-xtech.html
Simple API for XML (SAX)
- The definitive description of SAX is available online. You can also
download free SAX software here:
http://www.megginson.com/SAX/index.html
Document Object Model (DOM)
- The W3C information page for the Document Object Model appears on
the W3C site:
http://www.w3c.org/DOM/
- Among other things, you'll find the W3C Recommendation for DOM Level
1:
http://www.w3.org/TR/REC-DOM-Level-1/
- The Java bindings for DOM, for both XML and HTML, are in this
Recommendation appendix:
http://www.w3.org/TR/REC-DOM-Level-1/java-language-binding.html
- A great DOM tutorial by William Robert Stanek appears on PC
Magazine Online in "Object-Based Web Design." This tutorial
includes a discussion of using DOM with IDL, CORBA's Interface
Definition Language:
http://www8.zdnet.com/pcmag/pctech/content/17/13/tf1713.001.html
Dynamic HTML
- The Dynamic HTML Resource page contains several links to DHTML
articles:
http://www.hotwired.com/webmonkey/dynamic_html/?tw=dynamic_html
Software
- Epicentric, Inc.:
http://www.epicentric.com/
- More XML (and other Java) technology than you can shake a stick at
is available at IBM's alphaWorks:
http://alphaworks.ibm.com/
- Version 2 of IBM's excellent XML parser package, xml4j, is available
for download. This package includes several parsers, both validating and
nonvalidating:
http://www.alphaworks.ibm.com/tech/xml4j
- See also IBM's exciting Bean Markup Language project, which uses XML
to represent and manipulate JavaBeans:
http://www.alphaworks.ibm.com/tech/bml
- Another free Java XML parser was written by the indefatiguable James
Clark, download at:
http://www.jclark.com/xml/xp/index.html
- XEENA is IBM alphaWorks's DTD-guided XML editor. You want it, you
need it, you gotta have it:
http://www.alphaworks.ibm.com/tech/xeena
- Mozilla.org is the open source community's effort to extend the
Netscape source code. Find out about it at:
http://www.mozilla.org/
- Information about XML and CSS in Mozilla appears at:
http://www.mozilla.org/rdf/doc/xml.html
- You can read about Sun's XML and Java initiatives at:
http://www.sun.com/990310/java_xml.jhtml
- In addition, Java Project X includes source code downloadable from:
http://developer.java.sun.com/developer/earlyAccess/xml/index.html
- ArborText has a suite of sophisticated tools for editing SGML, XML,
and XSL:
http://www.arbortext.com/Products/products.html
- Oracle8i from Oracle corporation uses XML inside the Oracle core:
http://www.oracle.com/xml/
- Download Oracle's free XML for Java parser:
http://technet.oracle.com/direct/3xml.htm
- Microsoft's Internet Explorer 5.0, released this month, implements
part of the XSL spec. You can find it on Microsoft's Web site -- and
also just about anywhere else:
http://www.microsoft.com/windows/ie/default.htm
- You can also download a beta release of Microsoft's XML Notepad
editor (limited to running only on Microsoft Windows):
http://www.microsoft.com/xml/notepad/download.asp
- Vervet Logic of Bloomington, IN, has announced XML <PRO>, a
commercial XML editor:
http://www.vervet.com/
- Majix, to transform XML to HTML via XSL, is available at:
http://www.tetrasix.com/
- If your French is rusty, you might want to try the English-language
site at:
http://www.tetrasix.com/english/default.htm
History
- Read about the history of HTML here. It's part of an online book, so
there's no telling for how long it will be available:
http://ei.cs.vt.edu/~wwwbtb/hardcopy/book/chap4/origins.html The
two chapters listed below (of the book "HTML Unleashed" by Rick Darnell,
et al., also cover some of the technical background of these languages.
- SGML history
http://www.webreference.com/dlab/books/html/3-2.html
- XML history (such as it is):
http://www.webreference.com/dlab/books/html/38-0.html
- Nothing to do on Friday night? Why not read up on the history of
SGML? Charles Goldfarb, considered by many to be the "father of SGML,"
reminisces publicly at:
http://www.sgmlsource.com/Goldfarb/history/index.htm
- Useful XML and SGML information appears at Goldfarb's Web site,
including a comprehensive XML book list:
http://www.sgmlsource.com/
Miscellaneous links
- Uche Ogbuji has written an interesting article in
LinuxWorld about using XML on Linux in the Enterprise. It's at:
http://www.linuxworld.com/linuxworld/lw-1999-03/lw-03-xml.html
- Bluestone Software has recently made a splash with pure-Java XML
application servers, and a freely downloadable Swing package called
XwingML:
http://www.bluestone.com/
- Everyone (except Microsoft) is pretty freaked out about the US
Patent Office awarding Microsoft a patent for certain kinds of
functionality in style sheets. What happens with this patent, and its
impact on developing technology, remains to be seen. Judge for yourself
by reading the patent at:
http://www.patents.ibm.com/patlist?icnt=US&patent_number=5860073
- The title of the sample recipe is actually the title of a very funny
song by William Bolcom. Similar recipes may be found at:
http://www.b4uby.com/granny/gsoup.htm
- The song appears on a compact disc (with other odd songs) available
from the Public Radio Music Source at:
http://75music.org/best/docs/keepers.htm
|
 |